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(Abstract,  continued) 

^decide  now  many  components  are  in  the  mixture,  a  difficult  multiole  decision 
problem.  , — _ _ _ 

In  the  statistical  literature,  several  hypothesis  testing  variety  of 
criteria  have  been  proposed  for'^thTs^siyn^ose.  However,  all  these  criteria 
possess  sampling  distrioutional  problems."-*  What  the  null  distribution  of  the 
criterion  is  if  tne  data  actually  contain  k  clusters  is  not  known,  ana  remains 
largely  unresolved  still.  j 

Two  well  xnown  model -selection  criteria,  namely  Akaike's  Information 
Criterion  (AIC)  and  Schwarz's  Criterion  are  proposed  for  tne  first  time  as  two 
new  approacnes  to  the  problem  of  what  the  appropriate  choice  of  k  in  the  mix¬ 
ture  multinormal  model  should  be.  The  forms  of  these  two  moael -selection 
criteria  are  obtained  in  the  standard  multivariate  normal  mixture  model. 
Analyses  are  carried  out  on  the  same  data  set  by  applying  tne  model^eTection 
criteria  for  different  choices  of  k  using  the  mjxt'Ure^aTgoritnm  under  two 
assumptions  with  common  covari afsce  matri ces  between  the  component  normals,  and 
witn  varying  covariance  matrices  in  determining  the  appropriate  number  of 
types  or  clustersT"^  The  results  are  obtained  when  data  initially  partitioned 
into  equal  size  groups;  when  data  initially  reordered;  wnen  data  initialized 
by  k-means  algorithm;  when  data  initialized  by  special  initialization  scheme; 
and  when  special  initialization  scheme  is  used  on  reordered  data. 
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ABSTRACT 

The  problem  of  clustering  individuals  is  considered  within  the  context  of 
a  multivariate  normal  mixture  using  model -selecti on  criteria.  Often,  the 
number  k  of  components  in  the  mixture  is  not  known.  In  practical  problems, 
the  question  arises  as  to  the  appropriate  choice  of  k.  The  problem  is  to 
decide  how  many  components  are  in  the  mixture,  a  difficult  multiple  decision 
problem. 

In  the  statistical  literature,  several  hypothesis  testing  variety  of 
criteria  have  been  proposed  for  this  purpose.  However,  all  these  criteria 
possess  sampling  distributional  problems.  What  the  null  distribution  of  the 
criterion  is  if  the  data  actually  contain  k  clusters  is  not  known,  and  remains 
largely  unresolved  still. 

Two  well  known  model -selection  criteria,  namely  Akaike's  Information 
Criterion  (AIC)  and  Schwarz's  Criterion  are  proposed  for  the  first  time  as  two 
new  approaches  to  the  problem  of  what  the  appropriate  choice  of  k  in  the  mix¬ 
ture  multinormal  model  should  be.  The  forms  of  these  two  model -selection 
criteria  are  obtained  in  the  standard  multivariate  normal  mixture  model. 
Analyses  are  carried  out  on  the  same  data  set  by  applying  the  model -sel ecti on 
criteria  for  different  choices  of  k  using  the  mixture  algorithm  under  two 
assumptions  with  common  covariance  matrices  between  the  component  normals,  and 
with  varying  covariance  matrices  in  determining  the  appropriate  number  of 
tyoes  or  clusters.  The  results  are  obtained  when  data  initially  partitioned 
into  equal  size  groups;  when  data  initially  reordered;  when  data  initialized 


by  k-means  algorithm;  when  data  initialized  by  special  initialization  scheme; 
and  when  special  initialization  scheme  is  used  on  reordered  data. 

Key  Words  and  Phrases:  Standard  multivariate  normal  mixture  model;  Akaike's 
Information  Criterion  (AIC);  Schwarz's  Criterion  (SC). 
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I.  Introducti on 

What  is  the  most  appropriate  number  of  clusters  for  a  set  of  data?  How 
do  we  decide  the  number  of  clusters  present  in  the  data?  Which  cluster  or 
clusters  do  we  choose?  These  are  some  fundamental  questions  confronting 
practitioners  and  research  workers  in  classification  and  clustering.  The 
importance  and  the  difficulty  of  this  problem  have  been  noted  by  many  authors 
such  as  Beale  (1969),  Marriott  (1971),  Calinski  and  Harabasz  (1974),  Maronna 
and  Jacovkis  (1974),  Matusita  and  Ohsumi  (1980),  and  others.  For  a  good  dis¬ 
cussion  on  some  of  the  test  procedures  used  in  deciding  and  determining  the 
number  of  clusters,  we  refer  the  reader  to  Milligan  (1981),  Dubes  and  Jain 
(1979),  and  Everitt  (1979,  1974). 

It  is  reasonable  for  an  investigator  to  discover  whether  there  is  any 
structure  in  the  data,  or  whether  they  indicate  just  a  single  cluster  or 
group.  If  there  is  only  one  group,  that  is,  no  cluster  structure,  then  most 
investigators  would  decide  that  clustering  techniques  were  not  needed.  Dis¬ 
covering  the  structure  in  the  data  has  its  own  practical  importance.  For 
example,  in  studying  medical  and  psychological  syndromes;  processing  remotely 
sense  data  for  target  identification  or  for  predicting  crop  yields;  in  prob¬ 
lems  of  taxonomy;  and  in  many  other  applications  we  might  want  to  find  out 
whether  the  observations  fall  into  natural  groups  or  not.  If  they  do,  then  we 
mignt  want  to  discover  how  many  groups  or  clusters  there  might  be,  and  how  do 
we  identify  and  interpret  them? 

In  the  literature,  numerous  attempts  have  been  made  to  devise  reasonable 
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indicators  for  the  problem  of  choosing  the  number  of  clusters  present,  identi¬ 
fication  and  interpretation  of  clustering  results  by  many  investigators.  Still 
today,  however,  there  is  no  satisfactory  solution  and  a  unified  flexible 
approach.  The  major  difficulties  with  deriving  formal  significance  tests 
similar  to  those  of  ordinary  "t"  and  "F"  test  statistics  in  cluster  analysis, 
appear  to  be  the  difficulty  of  determining  the  sampling  distribution  of  the 
proposed  test  statistic  [Everitt  (1979)].  The  problem  of  deriving  a  sampling 
distribution  is  formidable,  and  the  choice  of  a  fixed  level  of  significance  for 
comparison  of  different  number  of  clusters  with  various  number  of  parameters  is 
wrong  since  this  does  not  take  into  account  the  increase  of  the  variability  of 
the  estimates  when  the  number  of  parameters  is  increased.  Therefore,  the 
theoretical  difficulties  faced  in  deriving  sampling  distribution  of  a  proposed 
test  statistic,  in  the  context  of  cluster  analysis,  are  rather  involved  and  not 
practical.  This  point  has  been  advocated  by  Gnanadesikan  and  Wilk  (1969),  and 
others  in  the  literature. 

This  suggests  that,  if  we  use  the  formal  signficance  test  type  indicators 
or  statistics  in  conjunction  with  the  clustering  algorithms  or  techniques, 
then  we  must  devise  a  criterion  (or  criterions)  which  will  combine  both  the 
estimation  problem  and  the  testing  together  to  decide  on  the  number  of 
clusters  present  in  a  data  set. 

Therefore,  in  tnis  paper  we  shall  propose  and  establish  two  theoretically- 
based  procedures  in  deciding  and  determining  the  number  of  clusters  present, 
identifying  the  best  clustering  alternative  or  alternatives.  We  shall  achieve 
this  by  introducing  two  well  known  model-selection  criteria,  namely,  Akaike's 
Information  Criterion  (AIC),  and  its  derivative,  Schwarz'  Criterion  (SC)  as  two 
new  and  unifying  procedures. 

Thus,  the  main  rocus  of  this  oaper  will  be  to  snow  how  to  use  these 
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two  model -selection  criteria  in  deciding  and  determining  the  number  of 
component  clusters  present  in  the  standard  multivariate  normal  mixture  model 
without  knowing  a  priori  classi fication  of  the  observations. 

In  Section  2,  we  shall  discuss  the  standard  multivariate  normal  mixture 
model  and  the  clustering  criteria  used  under  this  model,  namely,  the  maximum 
likelihood  approach.  In  Section  3,  we  shall  discuss  and  review  the  use  of 
fitting  the  mixture  model  to  determine  the  number  of  component  clusters,  and 
its  corresponding  unresolved  problems.  We  shall,  in  Section  4,  present  the  two 
model -selection  criteria,  and  list  their  important  general  characteristics.  In 
Section  5,  we  shall  give  the  forms  of  the  model -selection  criteria  to  be  used 
in  standard  normal  mixture  model  approach  to  clustering.  We  shall  apply  these 
two  model -selection  criteria  in  Section  6  in  deciding  and  identifying  the  num¬ 
ber  of  components  or  clusters  present  in  the  Fisher  iris  data  and  present  the 
numerical  results.  Finally,  in  Section  7,  we  shall  present  conclusions  and 
discussion. 

2.  The  Standard  Multivariate  Normal  Mixture  Model 

2. 1.  The  Model 

As  has  been  suggested  before  [see,  e.g.,  Fleiss  and  Zubin  (1969)],  often 
when  we  consider  clustering  problems  it  seems  relevant  and  logical  to  consider 
the  sample  as  arising  from  several  different  populations  rather  than  a  single 
population  since  the  individuals  within  a  class  or  group  differ  from  one 
another.  That  is,  each  individual  in  the  sample  is  assumed  to  have  come  from 
one  of  several  populations  (types). 

Given  a  sample  from  the  overall  mixed  population,  or  assuming  that  the 
sample  has  come  from  a  mixture  population,  the  problem  from  a  clustering  view¬ 
point  is  to  determine  and  describe  the  number  of  subpopul ations  or  groups,  the 
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parameters  of  the  distribution  characterizing  each  subpopulation  or  group,  and 
which  group  each  individual  belongs  to. 

Therefore,  the  problem  of  clustering  individuals,  objects,  or  cases,  to  be 
considered  here,  will  be  studied  within  the  context  of  a  mixture  of  multi¬ 
variate  normal  distributions. 

More  specifically,  we  shall  consider  a  multivariate  normal  mixture  model, 

K 

(2.1.1)  f(X)  2  f(X;n>ti,Z)  -  l  Vlc(*;ttlc,£|() 

where  n  *  (nj  ,n2 . nK-l^  are  K  "  1  independent  mixing  proper-  f  and  are 

such  that 

dr 

o  <  nk  <  i  nK  =  l  -  J  nk  , 

and  where  f|c(X;u  ,1  )  is  the  k-th  component  multivariate  normal  density,  given 
~  k  k 
by 

(2.1.2)  yX;^,^)  *  (2n)"p/2j^r1/2exp{-l/2(X  -  u^'Z^X  -  uk)}. 

The  model  given  by  the  p.d.f.  in  (2.1.1)  is  called  the  standard  multi¬ 
variate  normal  mixture  model  to  distinguish  it  from  the  modified  conditional 
mixture  model  considered  by  Symons  (1981),  Sclove  (1977,1982),  Scott  and  Symons 
(1971),  and  John  (1970). 

In  the  statistical  literature,  several  authors,  including  Wolfe  (1970), 

Jay  (1969),  3inder  (1978),  Hartigan  (1977),  and  others,  have  considered  clus¬ 
tering  problems  in  which  a  standard  mixture  of  multivariate  normals  ' s  used  as 
a  statistical  model  given  by  (2.1.1). 
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2.2.  Clustering  Criteria:  The  Maximum  Likelihood  Approach 

Wolfe  (1970)  has  considered  clustering  based  on  the  standard  normal  mix¬ 
ture  model.  He  uses  the  maximum  likelihood  (ML)  approach  to  estimate  the 
mixing  proportions  nk ,  the  mean  vector  uk  and  the  covariance  matrices  z, .  The 
maximum  likelihood  estimators  (MLE's)  nave  well  known  desirable  properties  and 
it  is  natural  to  consider  the  ML  approach  for  estimating  the  parameters  in  a 
mixture  of  multivariate  normal  distributions.  To  estimate  the  parameters  the 
likelihood  of  the  data  is  required  which  is  given  by 

n  K 

(2.2. i)  L(x|e)  =  n  {  l  yk(Xi;V4)}  ’ 

or  the  log  of  the  likelihood  is 


(2.2.2) 


1  5  1  °9  L (_X | e )  =  l  log  {  l  (X  ;u  ,z  )} 
e  ~  i=i  e  k=1  .  k  -i  -k 


It  is  the  likelihood  in  (2.2.1)  or  the  log  likelihood  in  (2.2.2)  that  is 

maximized  with  respect  tog  =  (i^.I^, ...  ,nk  ...  •••  ) ,  the 

vector  of  parameters,  by  Wolfe  (1970)  and  Day  (1969).  The  maximum  likelihood 

equations  are  obtained  by  equating  the  first  partial  derivatives  of  (2.2.2)  witn 

respect  to  the  n,  ,  the  elements  of  each  vector  u,  ,  and  those  of  each  matrix  z,  , 
k  -k  — < 

to  zero.  These  equations  are  solved  iteratively  by  a  modified  Newton-Raphson 
method.  The  iterative  MLE's  are  given  by 


(2.2.3) 


1,  =  -  V  P(k  :  X. ) 
<  n  iSl 


k=l ,2, . . . ,K-1 


ik 


nnk 


I  x,  Pfk.xj  k=l  ,2, . . . 

i =1 


,K 


(2.2.4) 
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(2.2.5) 


nn 


n 

-  7 

i  =1 


P(k 


X.)(X.  - 

-1 '  v~i 


><-*1 


*<  > ' 


.<  =  1,2 


K 


where 

El^  =  the  mixing  proportion  for  type  or  cluster  k, 

=  vector  of  means  for  cluster  k, 

~  covariance  matrix  for  cluster  k, 

X.  =  vector  of  observations  for  the  i-th  point  in  the 
~1  sample,  and 

.  ^k^k  ^-i  "*~k  *^k  ^ 

P(k|X.  )  =  — jr— : - - — - —  =  posterior  probability  of 

V  nMXiI  Uv..L)  group  membership  of  X.  in 

)  k  -l  -k  -<  c1uster  k.  -i 


If  the  clusters  have  a  common  covariance  matrix,  then  we  use 


(2*2*6)  -  s  "n  ^  -i-i  '  -  ^ • 

n  i=l  1  1  k=l  *  *  * 

Since  the  iterative  process  is  used  to  solve  the  equations,  actually, 
several  sets  of  values  may  satisfy  the  equations,  and  the  results  may  depend  on 
the  initial  values  for  the  iteration  process.  Since  mixture  analysis  attempts 
to  find  maxi  mum-likeli  hood  estimates  of  the  parameters,  the  best  solution  for 
our  purposes  is  the  one  with  the  greatest  likelihood,  or  the  greatest  log  like- 
1 1  hood. 

Once  the  MLE's  are  known,  we  can  regard  each  distribution  as  indicating  a 
separate  cluster,  and  individuals  are  then  assigned  by  the  3aye$  allocation 
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rule.  That  is,  assign  X.  to  the  <-th  distribution  when 

A  A  A  A 

(2.2.7)  ^1  f(!i  tZ, )  <  n^fXi^k,^)  for  all  1  *  k  . 

This  process  is  repeated,  increasing  the  log  likelihood  at  each  stage,  until  no 
further  reallocation  of  the  X's  occurs.  Another  way  to  put  it,  individual  i  is 
assigned  to  that  component  (or  group)  k  for  which  the  estimated  posterior 
probability  of  group  membership,  P(k|X..),  is  largest.  Therefore,  for  a  parti¬ 
cular  individual  i,  the  optimal  {  0  ( k  |  X  )>  will  be  P  ( k  |  )  =  1  when  the  indivi¬ 
dual  i  is  from  component  (or  group)  k  and  zero  otherwise. 

It  should  be  noted  here  that,  this  is  one  of  the  points  where  the  standard 
normal  mixture  model  considered  here,  differs  from  that  of  the  conditional  mix¬ 
ture  model .  That  is,  in  the  conditional  mixture  model,  the  individual  i  is 
assigned  to  group  k  for  which  the  estimated  density  is  largest  rather  than  the 
estimated  posterior  probability  of  group  membership  which  happens  to  be  the  case 
in  the  standard  normal  mixture  model.  For  more  details  on  this,  refer  to  Sclove 
(1979). 

3.  Fitting  the  Mixture  Model  to  Determine  the  Number  of  Component  Clusters: 

Unresolved  Problems 

As  we  mentioned  in  Section  1,  we  may  want  to  ask  wnether  there  really  is  a 
mixture  or  wnether  there  is  just  a  single  underlying  component  cluster.  In 
practice,  this  could  be  the  sort  of  question  we  might  be  interested  in  since 
fitting  the  standard  normal  mixture  model  to  determine  the  number  of  component 
clusters  has  many  practical  importance  and  use.  For  example,  we  may  want  to 
determine  the  number  of  disease  types  in  the  study  of  disease  patterns,  the 
blood  oressure  types,  and  psychiatric  disorder  types.  In  reliability  analysis, 


we  may  want  to  determine  the  number  of  laser  types  on  tne  basis  of  mean  laser 
life.  Lasers  are  employed  in  telephone  communication  systems  in  which  coherent 
laser  light  is  used  to  transmit  telephone  communications.  In  image  process¬ 
ing,  we  may  want  to  determine  the  number  of  classes  of  segments,  etc. 

As  it  was  noted  by  Sokal  (1977),  the  problems  of  inference  on  the  number 
of  clusters  ''actually1'  present  in  a  set  of  data,  and  of  testing  for  model  fit, 
have  not  yet  received  much  successful  attention  but  more  and  more  are  recog¬ 
nized  as  important. 

Thus,  the  standard  mixture  problem  will  be  to  decide  how  many  component 
clusters  are  in  the  mixture,  a  difficult  multiple  decision  problem.  A  simpler 
problem  is  to  decide  whether  k=r  or  k=r+l  component  clusters  are  necessary. 

In  practice,  it  is  common  to  specify  a  larger  hypothesized  number  of  clusters, 
say  k,  and  create  sequence  of  k»l,2,...,K  component  clusters  by  using  the  mix¬ 
ture  algorithm. 

In  the  literature,  several  methods  have  been  proposed  in  determining  the 
number  of  component  clusters  when  the  technique  of  fitting  standard  normal 
mixture  model  is  used.  One  type  of  these  techniques  are  informal  graphical 
techniques,  and  the  other  type  is  more  formal  hypothesis  testing  variety  of 
technique.  . 

When  the  technique  of  fitting  mixture  of  distributions  is  used  as  a 
clustering  technique,  likelihood  ratio  test  is  a  more  natural  criterion  for 
testing  the  number  of  component  clusters  or  groups  in  this  context.  However, 
as  we  snail  see,  it  has  its  thorny  problems. 

Let  Lt<  denote  the  maximized  likelihood,  for  given  k.  Then 
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is  the  likelihood  ratio  statistic  for  testing  k  clusters  against  k'  clusters 
(k  <  k‘).  From  a  Monte  Carlo  investigation,  Wolfe  (1971)  arrived  at  and 
suggested  an  adjusted  likelihood  ratio  test  in  which  the  statistic  (not  count¬ 
ing  the  mixing  proportions): 

(3.2)  -2  •  ^  (n  -  1  -  p  -  )logg\  -  x^f  (chi-square) 

with  degrees  of  freedom  f  =  2p ( k 1  -  k),  where 
n  =  sample  size, 

p  =  number  of  variables, 

k  =  number  of  component  types  in  the  null  hypothesis, 

k'  =  number  of  component  types  in  the  alternative  hypothesis. 

After  performing  a  small  scale  simulation  study,  Wolfe  (1971)  on  the 
basis  of  the  results  recommended  using  the  modified  likelihood  ratio  test  given 
in  (3.2)  for  k=l  against  k=2,  when  under  the  alternative  hypothesis  the  two 
components  are  assumed  to  have  the  same  variance-covariance  matrix.  But, 
Wolfe's  simulation  results  suggest  that  even  in  reasonably  large  sample  sizes, 
the  statistic  in  (3.2)  does  not  appear  to  be  asymptotically  the  usual  chi- 
square.  In  Wolfe's  simulation,  some  of  the  sample  means  and  variances  are 
quite  different  from  those  corresponding  to  the  stipulated  chi-square  distribu¬ 
tions.  Also,  the  same  results  may  not  be  true  when  under  the  alternative 
hypothesis  the  two  components  are  assumed  to  have  different  variance-covariance 
matrices.  Moreover,  it  is  important  to  note  that  in  the  standard  mixture  prob- 
1 em,  the  likelihood  function  is  a  different  function  for  different  values  of  k, 
where  <*1,2, ...K.  Therefore,  in  the  context  of  the  standard  mixture  model,  the 
question  that  arises,  and  that  remains  largely  unresolved  still,  is  what  the 


tr 
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asymptotic  null  distribution  is  if  the  data  actually  contain  k=l,2,...,K 
clusters. 

Others  in  the  statistical  literature,  have  also  cited  the  distributional 

problems  of  the  likelihood  ratio  test  statistic  in  the  mixture  problem.  For 

example,  Hartigan  (1977)  speculated  that  the  log  likelihood  ratio  lies  between 
12  12 

j  X  p  and  -g  x  p+^»  where  p  is  the  number  of  variables.  Binder  (1978),  on  the 
other  hand,  argues  that  the  likelihood  ratio  criterion  given  in  (3.2)  is  not 
necessarily  asymptotically  chi-square  distributed  since 

Hq:  n*l 

(3.3) 

H.:  0  <  n  <  1. 

Here,  under  the  null  hypothesis,  n,  the  mixing  proportion,  is  on  the  boundary  of 
the  parameter  space,  and  the  likelihood  ratio  criterion  takes  the  value  zero 
when  n,  the  maximum  likelihood  estimate  for  fl,  is  1  with  probability  j  ,  and 
therefore,  under  the  null  hypothesis,  the  likelihood  ratio  criterion  cannot  be 
asymptotically  x2* 

Behboodian  (1972),  shows  that  as  the  component  densities  become  closer 
and  closer  to  each  other,  the  information  matrix  approaches  a  singular  matrix 
with  some  diagonal  elements  equal  to  zero.  The  same  thing  happens  when  the 
mixing  parameter  n  tends  to  one  or  zero.  Consequently,  Behboodian  concludes 
that  for  estimating  the  parameters  in  a  mixture  where  two  component  clusters  are 
well  separated,  or  which  has  a  mixing  proportion  close  to  zero,  very  large 
samples  may  be  needed.  For  example,  for  a  fixed  total  sample  size  n,  when  we 
run  the  mixture  algorithm  for  a  very  large  hypothesized  number  of  clusters  <, 
tne  mixing  proportion  n  starts  tending  to  zero.  To  put  it  in  another  way,  as  '<, 
the  desired  total  number  of  component  clusters,  gets  larger  and  larger  for  a 
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fixed  sample  size  n,  then  the  mixing  proportion  n  tends  to  zero.  Thus  causing 
tne  information  matrix  to  be  singular.  For  this  reason,  we  need  very  large 
samples  to  fit  the  standard  normal  mixture  model  to  ensure  that  the  component 
sample  sizes  are  large  enough  so  that  the  information  matrix  will  not  become 
singular. 

This  point  raises  another  important  question  as  to  what  the  appropriate 
hypothesized  number  of  component  clusters  K  should  be  for  a  fixed  sample  size 
n  to  fit  the  mixture  model.  In  the  literature,  this  point  has  never  been 
studied  before,  and  certainly  deserves  more  attention  which  will  be  a  subject 
of  futher  study  later. 

1/2 

A  rule  of  thumb,  however,  is  to  use  K~(n/2)  suggested  by  Mardia,  et. 
al.  (1979),  where  K  is  the  total  hypothesized  number  of  component  clusters, 

and  n  is  the  total  sample  size. 

In  reviewing  the  literature  further,  we  see  that  some  simulation  results 
of  Everitt  (1981)  show  that  the  suggestion  of  Wolfe  (1971)  seems  reasonable 
only  in  cases  where  n>10p.  That  is,  the  sample  size  n  is  of  order  lOp,  wnere 
p  is  the  number  of  variables,  for  testing  one  standard  normal  mixture  model 
against  only  two  standard  normal  mixture  models  when  the  two  components  are 
assumed  to  have  the  same  variance-covariance  matrix.  According  to  Everitt's 
large  scale  simulation  results,  Hartigan's  (1977)  conjecture  does  not  seem  to 
oe  correct.  However,  at  this  point,  EveHtt’s  results  cannot  be  extended  to 
oe  true  for  testing  two  standard  normal  mixture  models  against  tnree,  three 
against  four,  four  against  five,  and  so  forth,  since  there  does  not  exist  any 
'■easonable  .^onte  Carlo  validation  of  the  significance  tasting  procedure  given 


’  n  (3.2). 

utilizing  established  results  in  the  literature  on  the  distribution 


of  the  log  likelihood  ratio  test  statistic  wnen  the  true  parameter  is  "near" 
the  boundaries  of  the  hypothesis  regions,  we  can  reflect  the  key  distribu¬ 
tional  requirements  of  the  model. 

Following  Feder  (1968),  we  state  that,  when  the  data  can  be  represented 
by  n  independent  random  variables  with  identical  distributions  depending  on 
the  parameters  (9^  ,9,,,. . .  ,9^  )  then  the  limiting  distribution  (as  n-**>)  of 

(3.7)  -21ogg(likelihood  ratio) 

is,  under  certain  sequences  of  alternative  hypotheses  converging  to  the  null 
hypothesis  which  appears  to  be  the  case  in  testing  mixture  models,  a  non- 
central  chi -squared  distribution.  This  result  is  due  also  to  Wald  (1943). 

According  to  this  result,  it  seems  that  for  the  mixture  problem  the  key 
distributional  requirement  for  a  test  is 

(3.8)  -2C(n,p,K)loge\ald*xt:2(5)  (noncentral  chi-square) 

where 

f  =  number  of  degrees  of  freedom, 

5  *  noncentrality  parameter,  and 
1  K 

C(n,p,K)  =*  — (n  -  1  -  p  -  -j)  =  correction  factor, 

n  =  sample  size, 
p  =  number  of  variables, 

K  =  total  number  of  components  hypothesized  in  the  mixture 
model . 

In  the  next  section,  that  is,  in  Section  4,  we  shall  introduce  the  two 
well  known  model  selection  criteria  to  be  used  to  estimate  k  (k*l,2,...,K),  the 
number  of  component  clusters  in  the  standard  normal  mixture  model.  First  some 
general  explanations  on  model -selection  criteria  will  be  appropriate. 


4.  Model -Sel ection  Criteria 


In  the  literature,  model  selection  or  identification  problems  continues 
to  attract  a  great  deal  of  interest  among  statisticians  and  other  scientists. 
The  major  effort  in  this  respect  has  been  channeled  towards  simple  criteria 
for  choosing  one  of  a  set  of  competing  models  to  describe  a  given  data  set. 
Much  of  this  interest  has  been  stimulated  by  the  fundamental  work  of  Akaike 
(1973)  and  by  the  appearance  of  an  information  criterion  due  to  him,  known  as 
Akaike's  Information  Criterion  (AIC).  Therefore,  one  group  of  criteria  we  see 
in  the  current  statistical  literature  are  based  on  Boltzmann's  (1877)  entropy 
or  the  Kullback's  (1959)  information,  such  as  Akaike's  Information  Criterion. 
The  other  main  group  of  criteria  are  Bayesian.  Among  the  Bayesian,  in  par¬ 
ticular,  here  we  shall  consider  only  Schwarz'  Criterion  (SC). 

Next,  we  give  the  formal  definitions  and  some  of  the  important  character¬ 
istics  of  these  two  model -sel ection  criteria. 

4.1.  Akaike's  Information  Criterion  (AIC) 

Suppose  there  are  K  alternative  models  ,  k=l,2,...K,  represented  by  the 
densities  f  ^  ( •  { ) ,  f^(  • !  )  >  •  •  •  *  i1^)  Tor  the  explanation  of  a  random  vec¬ 

tor  X  and  given  n  observations.  In  1971,  Akaike  first  introduced  an  informa¬ 
tion  criterion,  which  has  become  known  as  Akaike's  Information  Criterion  (AIC1 
for  the  identification  and  comparison  of  statistical  models  among  a  class  of 
competing  models  with  different  number  of  parameters.  It  is  defined  by 

(4.1.1)  AIC(k)  =  -2  lnl>ax  L(k)l  +  2m(k), 
or  symbolically  is  defined  by 

(4.1.2)  AIC  =  -2  ln( maximized  likelihood) 

+2  (number  of  parameters  estimated  within 
the  model ) . 
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In  (4.1.1),  L(k)  is  the  likelihood  when  is  the  model,  max  denotes  its 
maximum  over  the  parameters,  and  m(k)  is  the  number  of  independent  parameters 

when  M,  is  the  model. 

< 

The  statistic  AIC(k),  was  obtained  by  Akaike  (1973,  1974)  with  the  aid  of 
an  information  theoretic  interpretation  of  the  method  of  maximum  likelihood. 

It  is  a  natural  estimate  of  minus  twice  the  expected  log  likelihood  of  the 
model  whose  parameters  are  determined  by  the  method  of  maximum  likelihood. 

The  minused  expected  log  likelihood  is,  except  for  an  additive  constant, 
identical  to  the  (generalized)  entropy,  or  the  "cross -entropy ,"  which  is  a  ? 

measure  of  goodness  of  fit  or  closeness  of  the  estimated,  fitted,  or  predi c- 
tive  model  to  the  true  model.  From  this  point  of  view,  when  several  competing 
models  are  being  compared  or  fitted,  AIC(k)  is  a  simple  procedure  which 
measures  the  badness  of  fit  or  the  di screpancy  of  the  estimated  .model  from  the 
true  model  when  a  set  of  data  is  given.  The  model  chosen  is  the  one  which 
minimizes  AIC  and  is  called  the  minimum  AIC  procedure.  The  first  term  in 
(4.1.1)  stands  for  the  penalty  of  badness  of  fit  when  the  maximum  likelihood 
estimators  of  the  parameters  of  the  model  is  used.  The  first  term  is  also 
known  as  the  measure  of  inaccuracy  [see,  e.g.,  Stone  (1982)].  The  second  term 
in  the  definition  of  AIC,  on  the  other  hand,  is  interpreted  as  representing  a 
penalty  that  should  be  paid  for  increasing  the  number  of  parameters,  or 
compensation  for  the  bias  or  increased  unrel i abi 1 ity  in  the  first  term  due  to 
the  increased  number  of  parameters.  The  second  term  in  AIC,  is  also  known  as 
the  compl exi ty  of  the  selected  model.  If  more  parameters  are  used  to  describe 
the  data,  it  is  natural  to  get  a  larger  likelihood,  possibly  without  improving 
the  goodness  of  fit.  Thus,  AIC  avoids  this  spurious  improvement  of  fit  by 
penalizing  the  use  of  additional  parameters.  In  this  sense,  the  MC  may  oe 
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regarded  as  an  explicit  formulation  of  principle  of  parsimony  in  model  build¬ 
ing.  In  the  statistical  literature,  the  interpretation  of  the  second  term  in 
A IC  as  a  measure  of  the  complexity  of  the  model  M,  ,  corresponds  to  the 
principle  known  as  Occam's  Razor,  which  emphasizes  the  desirability  of  select¬ 
ing  accurate  and  parsimonious  models  of  reality.  This  principle  is  also 
closely  related  to  the  principle  in  hypothesis  testing  which  emphasizes  the 
desirability  of  considering  "substantive"  significance  as  opposed  to  statisti¬ 
cal  significance.  For  more  details  on  this,  we  refer  the  reader  to  Hodges  and 
Lehmann  (1954). 

We  now  list  some  of  the  important  character: sties  of  Akaike’s  Information 
Criterion  ( A  IC )  as  follows: 

★ 

(i)  A IC  is  defined  without  specific  reference  to  the  true  model  ff 

Thus,  for  any  -finite  number  of  parametric  models,  we  may  always  con¬ 
sider  an  extended  model  that  will  play  the  roll  of  ff*!?^).  '"his 
suggests  that  A IC  can  be  useful  for  the  comparison  of  models  which 
are  nonnested,  i.e.,  the  situation  where  conventional  log  likelihood 
ratio  test  is  not  applicable  as  mentioned  by  Akaike  (1982). 

(ii)  The  value  of  A  IC  decreases  quickly  as  the  number  of  parameters  being 
adjusted  is  increased  and  then  increases  almost  linearly  when  too 
many  redundant  parameters  are  included  in  tne  model.  For  more  on 
this,  refer  to  Akaike  (1978),  Smith  and  Spi egel ha  1  ter  (1980). 

'iiil  According  to  A I C ,  inclusion  of  an  additional  parameter  is  appropriate 
if  Infmax  Ll  increases  by  one  unit  or  more*  i.e.,  if  max  L  increases 
by  a  'actor  of  e  or  more. 


•v'  IC  can  have  positive  or  negative  values  depending  on  the  situation. 
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If  we  let  L ( k )  =  max  t(k)  when  M,  is  the  model,  with,  say  k,  numoer 
of  parameters,  and  L(k+1)  =  max  L(k+1)  when  is  the  model,  with, 

say  k+1 ,  number  of  parameters,  and  if  L(k+1)/L(k)  >  e,  then  AIC(k)  is 
positive.  If  L(k+1)/L(k)  <  e,  then  AIC(k)  is  negative. 

(v)  A IC  does  not  require  level  of  significance  or  table  look-up. 

(vi)  The  relationship  between  the  AIC  and  the  conventional  likelihood 
ratio  test  statistic  can  be  written  as 

( -2)  1  n  \(HQ;H1)  =  AIC(HQ)  -  AIC^)  -  2k, 

wnere  the  model  contains  the  model  as  a  restricted  family  of 
distributions  of  H1  and  k  denotes  the  degrees  of  freedom  of  the  chi- 
square  distribution  of  the  likelihood  ratio  test  statistic. 

A. 2.  Schwarz1  Criterion  (SC) 

Schwarz  (1978)  proposed  a  model  selection  procedure  which  minimizes  the 
criterion, 

(4.2.1)  SC(k)  =  -2  ln[max  L(k) ]  +  m(k)ln(n), 

where  n  is  the  number  of  independent  observations.  This  criterion  is  obtained 
by  analyzing  the  behavior  of  the  posterior  probability  of  the  model  when 
n  grows  to  infinity  under  the  assumption  of  some  arbitrary  positive  a  priori 
probability  distributions  on  the  parameters.  Therefore,  this  criterion  is  a 
3ayesian  criterion.  For  this  reason,  we  shall  abbreviate  it  as  SC,  instead  of 
SIC.  One  should  note  that,  SC  and  AIC  are  qualitatively  the  same,  but  they 
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are  quantitatively  different  from  one  another  only  in  that  the  number  of  esti¬ 
mated  parameters  is  multiplied  by  ln(n),  the  natural  logarithm  of  the  sample 
size. 

We  now  list  some  of  the  important  characteristics  of  Schwarz'  Criterion 
(SC)  as  follows: 

(i)  SC  assumes  a  fixed  penalty  for  guessing  the  wrong  model. 

(ii)  For  small  sample  sizes,  SC  favors  lower-dimensional  models  as 

compared  to  AIC.  However,  depending  on  the  nature  of  the  priors  on 
the  parameters  and  the  nature  of  the  model  fitted,  Schwarz'  approxi¬ 
mation  may  fail  in  small  samples.  Nevertheless,  for  large  sample 
sizes  it  has  its  own  advantages. 

(iii)  According  to  Schwarz'  Criterion  (SC),  an  additional  parameter  will  be 
included  if  it  increases  ln[max  L]  by  an  amount  ln(n)/2,  that  is,  if 
max  L  increases  by  a  factor  of  /  n  or  more. 

(iv)  Like  AIC,  SC  can  also  have  positive  or  negative  values  depending  on 

A  A 

the  situation.  That  is,  if  L(k+l)/l(k)  >  /  n  ,  then  SC(k)  is 
positive.  On  the  other  hand,  if  l(k+l)/l(k)  <  /  n  ,  then  SC(k!  is 
negative. 

(v)  Also  SC  does  not  require  level  of  significance  or  table  look-up. 

5.  The  Forms  of  Model -Selection  Criteria  in  Standard  Normal  Mixture  Model 

Despite  the  recent  development  of  the  use  of  statistical  methodology  and 
models  in  many  disciplines,  it  seems  t^at  in  many  situations  the  difficulty  of 
constructing  an  adequate  nodel  based  on  the  available  sample  information  is 


not  fully  recognized.  Cluster  analysis  is  a  case  in  Humt. 

Recall  that  k  denotes  tne  number  of  clusters  or  component  clusters. 
Usually  k  is  permitted  to  vary:  k=l,2,...,K,  say.  Eacn  choice  of  <  cor¬ 
responds  to  a  different  model  for  the  data.  One  has  to  estimate  the 
parameters,  say  ^9,  of  this  model.  Then  one  computes  the  likelihoods  L(k), 
k=l,2,...,K  and  is  faced  with  the  problem  of  comparing  them.  That  is,  in 
classification  and  clustering  we  have  the  problems  of  identifying  and  dis¬ 
covering  the  number  of  clusters  present  in  the  standard  mixture  model ,  without 
any  a  priori  information  about  the  data. 

Such  problems  of  statistical  model  identification  suggest  the  introduc¬ 
tion  and  the  application  of  practically  useful  and  versatile,  and  yet  theore¬ 
tically  sound  criteria  of  "fit"  of  models  such  as  the  ones  we  discussed  in 
Section  4. 

We,  next,  give  the  forms  of  AIC  and  SC  to  be  used  in  standard  normal 
mixture  model  approach  to  clustering. 

For  the  standard  mixture  model,  we  first,  consider  our  conjecture  in 
(3.8)  and  show  the  form  of  AIC  under  this  conjecture  by  stating  and  proving 
the  following  theorem. 

Theorem  5.1.  If  -21 n  \acd*  x  -  (5 ) (non-central  chi-square)  with  f  =  2(M-m) 
degrees  of  freedom,  then 

(5.1)  4IC*(k )  =  -2Cln[max  L(k) ]  +  3m(k), 

1  K 

where  C  *  —  (n  -  1  -  p  -  4-  )  =  correction  factor, 
n  c 

k=l,2,...,<  =  number  of  component  clusters,  or  types. 


-19- 


m  =  m(k  ) , 

m(k)  =  kp  +  (k-1)  +  PjPjlil  _  number  of  parameters  including  the 

fixture  proportions  when  covariances 
are  equal , 

m(k)  =  kp  +  (k-1)  +  k  =  number  of  parameters  including  the 

mixture  proportions  when  covariances 
ane  different  between  clusters,  and 


M  =  m(K) . 


Proof.  In  general, 

(5.2)  -2nE[B(f;f)]  =  -2nE[entropy]  =  5  +  m. 

p 

where  E  denotes  the  expected  value,  6  =  n||^9  "  '^true'^J  is  the  noncent:ra^ 
parameter,  "  |  |  •  { | "  stands  for  the  Euclidean  norm  with  respect  to  J  =  (J^.), 
the  (kxk)  Fisher  information  matrix,  and  m  denotes  here,  the  number  of  para¬ 
meters.  We  asserted  in  (3.8)  that 

(5.3)  -21  n  x3-'1*  x^2  (5), 

where  f  =  2(M-m)  is  the  number  of  degrees  of  freedom,  and  6,  is  the  noncen¬ 
trality  parameter.  As  is  well  known, 


(5.4) 


-2C1  n  \  =  E[-2C1  n  \1  =  E[xf2f<5)l  =  5  +  f  =  5  +  2(M-m). 


4ence,  solving  (5.4)  for  5,  the  noncentrality  parameter,  we  have 


(5.5)  5  *  -2C1 n  \-2(M-m). 

Mow  substituting  (5.5)  into  (5.2),  we  obtain 

'5.5'  -2nE[3I f ;f ) ]  =  5  -  m 

=  -2C1  n  \-2(.'1-m)  +  m 


-2C1  n  \-2M 
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Si  nee 

(5.7)  In  \  =  In  j"»^(  =  1n[max  L('< )  1-1  nTmax  L(K)], 

and  since  MC  estimates  the  quantity  -2n£r8],  then  from  (5.6),  we  have 

(5.3)  AIC  =  -2C1 n[max  L(k ) ]  +  3m  -  2M  +  2Cln[max  L(K)1 

For  comparison  purposes,  it  suffices  to  ignore  the  additive  terms  -21  and 
2C1 nfmax  L(K)]  in  (5.3).  Thus,  for  the  standard  mixture  model  AIC  in  (5.3) 
takes  the  simple  form 


(5.9)  AIC*(k )  =  -2C1 n[max  L(k)]  +  3m. 

To  make  A I C  * '  < '  ccmoatable  with  SC(k),  we  can  even  drop  C,  the  correction 
factor,  and  use 

(5.10)  AIC* ( lc )  =  -21  n[max  L(k)]  +  3m. 

As  we  mentioned  before,  stimulated  by  the  appearance  of  the  Akaike’s 
Information  Criterion  (AIC),  Schwarz  (1978)  has  recommended  the  model  selection 
criterion, 

(5.11)  SC ( k )  =  -21 nfmax  L(k)]  +  m(k)ln(n), 


wnere  k=l,2,...,K  =  number  of  component  clusters,  or  types, 
m  =  m(k ) 

m(k)  *  <p  +  (<-l)  +  =  number  of  parameters  including  the 

"  mixture  Droportions  when  covariances 

are  equal , 


3 


numoer  of  oarameters  including  the 
mixture  proportions  wnen  covariances 
are  different  between  clusters,  and 
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M  =  m(K) 

for  the  standard  mixture  model. 

Having  defined  these  two  well  known  model -selection  criteria  for  the 
standard  normal  mixture  model,  in  the  next  section,  Section  6,  we  apply  these 
two  criteria  to  the  famous  Fisher  iris  data.  In  doing  so,  we  shall  attempt  to 
improve  Wolfe's  and  others'  results  without  the  worry  of  what  the  appropriate 
significance  level  a  should  be  in  testing  the  hypothesis  of  different  compo¬ 
nent  clusters  in  order  to  discover  or  identify  and  describe  the  clusters  or 
types  in  the  mixture  model. 

6.  Application  of  Standard  Normal  Mixture  Model  to  Fisher  Iris  Data 

In  this  section  we  shall  apply  the  standard  normal  mixture  model  to  the 
well-known  Fisher  (1936)  iris  data.  We  shall  give  the  numerical  results  from 
the  mixture  model  by  performing  different  analyses  on  the  iris  data  by  apply¬ 
ing  the  model -selection  criteria  for  differnt  choices  of  k.  We  shall  accomp¬ 
lish  this  by  using  the  mixture  algorithm  under  two  assumptions:  common 
covariance  matrices  between  the  component  normals,  and  varying  covariance 
matrices  in  determining  the  actual  number  of  types  or  species  in  the  Fisher 
iris  data. 

The  iris  data  consist  of  four  characteristics  (p=4)  for  three  species  of 
iris;  the  species  are  Iris  setosa  (S),  Iris  versicolor  (Ve),  and  Iri s 
vi rgi ni ca  (Vi),  and  the  characteri sti cs  are  sepal  length,  sepal  width,  petal 
length,  and  petal  width.  Each  group  is  represented  by  50  plants,  and  hence 
tnis  data  set  is  composed  of  150  iris  species  in  total. 


This  data  set  has  been  quite  extensively  studied  in  classification  and 
cluster  analysis  since  it  was  published  by  Fisher  (1936),  and  still  today, 
is  being  used  to  test  the  practical  utility  of  various  classification  and 
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clustering  r methods  proposed  by  many  investigators  such  as  Friedman  and  Rubin 
(1967),  Kendall  (1966),  Solomon  (1971),  Mezzich  and  Solomon  (1980),  and  many 
others,  including  the  present  author. 

For  each  of  the  150  plants  we  already  know  the  group  structure  of  the 
iris  species,  namely  K=3  groups  or  samples.  Even  though  the  two  species,  Iris 
setosa  and  Iris  versicolor  were  found  growing  in  the  same  colony,  and  Iris 
virginica  was  found  growing  in  a  different  colony,  Fisher  reports  in  his 
linear  discriminant  analysis  the  separation  of  I.  setosa  completely  from 
I.  versicolor  and  I.  virginica.  Since  then  other  investigators  have  shown 
similar  results  in  their  studies  such  as  the  ones  we  mentioned  above. 

With  this  in  mind,  for  our  purposes,  if  we  were  presented  with  the  150 
irises  in  an  unclassified  manner  (say,  before  the  three  species  were 
established),  then  the  mixture  analysis  using  model -sel ection  criteria 
attempts  to  discover  and  describe  the  types  of  irises  without  using  any 
a  priori  classification  information. 

Using  the  NORMIX  programs  (i.e.,  normal  mixture  programs)  of  Wolfe 
(1967),  which  are  modified  and  extended  by  this  author,  on  the  Fisher  iris 
data,  we  ran  normal  mixtures  with  different  covariance  matrices  between  the 
clusters  (i.e.,  types) ,  and  normal  mixtures  with  common  covariance  matrices. 

In  ooth  cases,  we  ran  k=l,2,...,7  types  and  computed  AIC*(k)'s  and  SC(k)'s  for 
identifying  the  best  component  cluster  or  clusters  under  the  following 
s i tuations : 

1.  When  the  mixture  algorithm  initially  partitions  the  data  into  equal 
size  groups; 

2.  When  the  data  initially  reordered  to  make  the  problem  oi<<ricu1t  *'or 
tne  mixture  algorithm; 


3.  When  the  results  from  k -means  algorithm  are  used  to  initialize  the 
mixture  algorithm  to  avoid  the  problem  of  local  maxima  of  the  like¬ 


lihood  function; 

4.  When  a  special  initialization  scheme  is  used  to  initialize  the 
mixture  algorithm  which  is  proposed  by  this  author;  and  finally 

5.  When  a  special  initialization  scheme  is  used  on  the  reordered  data  to 
start  the  mixture  algorithm,  again  to  avoid  the  problem  of  local 
maxima  of  the  likelihood  function. 

We  present  all  our  numerical  results  under  each  of  the  above  situations 
respectively,  as  follows. 

6. 1.  When  Data  Initially  Partitioned  into  Equal  Size  Groups 

When  no  special  initialization  is  used,  the  mixture  algorithm  in  the 
first  step  of  iteration  sets  the  belonging  probabilities  equal  to  one.  That 
is,  P(k|Xi)  =  1  when  the  individual  i  is  from  component  (or  group)  k  and  zero 
otherwise.  This  initialization  is  equivalent  to  partitioning  the  observations 
into  equal  size  groups.  Then  the  algorithm  estimates  the  number  of  obser¬ 
vations  from  the  kth  component  in  the  second  step.  In  the  third  and  fourth 
steps,  the  algorithm  estimates  the  cluster  means  and  the  within  cluster 
variance-covariance  matrices,  respectively.  In  the  fifth  step,  the  deter¬ 
minants  and  inverses  of  the  vari ance-covari ance  matrices  are  computed  for  each 
k  and  then  the  probability  densities,  the  average  densities,  and  the  log  like¬ 
lihood  function.  This  cycle  is  repeated  until  the  maximum-likelihood 
estimates  of  the  parameters  converge,  and  until  all  the  individuals  or  data 
units  are  assigned  into  their  respective  component  clusters  and  no  further 
reallocation  occurs. 


Under  this  situation,  we  ran  <  =  1,  k=2,  k-3,  k=4,  k=5,  k =6,  and  k  =  7 


components  or  types  and  computed  AIC*(k)'s  and  SC(k)'s  for  identifying  and 
selecting  the  best  component  cluster  or  clusters.  We  obtained  the  following 
results. 


TABLE  6.1.1.  THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE 

IRIS  DATA  WHEN  COVARIANCE  MATRICES  ARE  DIFFERENT  BETWEEN  CLUSTERS 


No.  of  Types 
k 

In [max  L (k ) ] 

No.  of  Parameters 

AIC*(k)C 

- d - 

SC  (k y 

1 

171.448 

14 

-300.896 

-272.748 

2 

337.008 

29 

-587.016 

-528.709* 

3 

371.177 

44 

-610.354* 

-521.887** 

4 

385.342 

59 

-594. 684** 

-476.057 

5 

397.178 

74 

-572.356 

-423.567 

6 

436.148 

89 

-605.296 

-426.349 

K=7 

439.528 

104 

-567.056 

-357.950 

Where  p=4  Variables,  n=150  Observations,  and 

a.  From  Iterative  Maximum  Likel i hood  .Estimates  in  Mixture  Model 
After  Convergence  Took  Place  when  36  Iterations  were  used. 

b.  m  =  kp+k-l+k^-^-)  =  Number  of  Parameters. 

c.  AIC*(k)  *  -21n[max  L(k)]  +  3ta. 

d.  SC(k)  =  -21n[max  L(k)]  +mln(n). 

*  First  Minimum  AIC*  and  SC. 

**  Second  Minimum  AIC*  and  SC. 


i 
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TABLE  6.1.2.  THE  AIC*(k)'s  AND  SC(k)‘s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 


No.  of  Types 
k 

In [max  L(k)]a 

No.  ofj^Parameters 

IT! 

A I C * ( k  )C 

SC  (k  )d 

1 

171.448 

14 

-300.896  ; 

' 

-272.  748 

2 

254.915 

19 

-452.830  , 

-414.629 

3 

295.009 

24 

:  -518.018 

-469.763 

4 

328.314 

29  j 

-569.628** 

-511.321* 

5 

334.076 

!  34 

i  1 

-566.152 

-497.791** 

6 

339.142 

39  ! 

t 

-561.284 

-482.37  0 

K=7 

355.  353 

l 

44 

-578.706* 

-490.176 

Where  p=4  Variables,  n=150  Observations ,  and 

a.  From  Iterative  Maximum  Likelihood  Estimates  in  Mixture  Model 
After  Convergence  Took  Place  when  36  Iterations  were  used. 

b.  m  =  kp+k-l+  =  Number  of  Parameters. 

c.  AIC*(k)  =  -21n[max  l(k)]  +  3n. 

d.  SC(k)  =  -21n[max  L (k ) ]  +  mln(n). 

*  First  Minimum  AIC*  and  SC. 

**  Second  Minimum  AIC*  and  SC. 

Examining  each  table  carefully,  starting  with  Table  6.1.1  where  the 
covariance  matrices  are  different  between  clusters  (or  types),  we  see  that  the 
first  minimum  AIC*  is  when  <=3  types,  the  second  minimum  AIC*  is  when  k=4 
tyoes.  That  is,  when  <=3  types  we  have  the  best  mixture  submodel.  This 
indicates  that  there  are  indeed  three  types  of  species  in  the  iris  data.  On 


r 
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the  other  hand,  the  first  minimum  SC  is  when  k=2  types,  and  the  second  minimum 
SC  is  when  k=3  types.  Thus,  according  SC  :=2  types  is  the  best  mixture 
submodel  indicating  the  fact  that  SC  favors  lower-dimensional  models  when 
compared  with  Akaike's  AIC*.  Nevertheless,  the  second  minimum  SC  is  when  k=3 
types  where  also  AIC*  achieves  its  first  minimum.  Hence,  the  mixture  model  has 
recovered  the  known  structure  among  the  150  iris  plants  and  we  are  capable  of 
identifying  it  by  using  the  minimum  AIC*  and  the  minimum  SC  procedures.  For 
tne  three-types  solution,  by  examining  the  confusion  matrix  of  group  member¬ 
ship,  we  see  further  that  the  I.  setosa  (Type  or  Cluster  1)  were  completely 
recovered,  as  I.  virginica  (Type  or  Cluster  3).  However,  five  plants  of 
I.  versicolor  (Type  or  Cluster  2)  were  classified  with  Type  3  and  therefore 
these  could  be  regarded  as  mi sclassi fied. 

In  Table  6.1.2  where  the  covariance  matrices  are  considered  to  be  equal 
between  clusters  (or  types),  we  see  that  the  first  minimum  AIC*  is  when  k=7 
types,  the  second  minimum  AIC*  is  when  k=A  types.  On  the  other  hand,  SC  favors 
k*4  first,  and  then  k*5  to  be  the  second  best  mixture  submodel.  These  results 
are  not  surprising  since  the  population  covariance  matrices  of  the  three  types 
of  irises  are  not  equal  to  each  other.  Moreover,  since  mixture  analysis 
attempts  to  find  maximum-likelihood  estimates  of  the  parameters,  the  best 
solution  for  our  purposes  is  the  one  with  the  greatest  likelihood,  or  the 
greatest  log  likelihood.  And  nence,  if  we  compare  ln[max  L(k)]  of  Table  6.1.1 
and  Table  5.1.2,  respecti vely ,  we  see  that  we  have  the  greatest  log  likelihoods 
for  each  component  clusters  in  Table  6.1.1,  except  when  k=l  of  course.  Thus, 
this  suggests  that  we  should  use  the  results  of  Table  6.1.1  wnere  the 
covariance  matrices  are  different  for  the  iris  data. 
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6.2.  When  Data  Initially  Reordered 

In  this  case,  we  made  the  problem  intentional 1y  harder  for  the  mixture 
algorithm  through  the  reordering  of  the  iris  data  sequentially.  We  chose  first 
three  species  from  each  group  and  sequentially  reordered  the  data  until  all  the 
150  flowers  were  scrambled  completely.  Such  reordering  of  the  data  makes  the 
algorithm  start  at  different  initial  estimates  of  the  parameters.  The  purpose 
of  doing  this  is  to  obtain  satisfactory  initial  estimates  of  the  parameters 
which  are  essential  if  we  need  to  avoid  misleading  solutions. 

i 

We  ran  again  the  NORMIX  program  assuming  both  different  and  equal 
covariance  matrices  between  the  clusters  (or  types)  for  k=l,  k=2,  k*3,  k=4, 
k=5,  k=6,  and  k=7  types.  For  each  of  the  clustering  alternatives,  we  computed 
AIC*(k)'s  and  SC(k)'s  to  be  able  to  identify  the  best  type  and  consequently 
determine  the  exact  number  of  types.  For  these  our  results  are  shown  in  Tables 
6.2.1  and  6.2.2. 

TABLE  6.2.1.  THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  DIFFERENT  BETWEEN  CLUSTERS 


No.  of  Typesl  ln[max  L(k)]a 
_ 2 _ _ 

No.  ofu Parameters 
mb 

j  AIC*  '  k ) c 

!  SC(k)d 

j 

1 

j 

|  171.448 

14 

1 

-300.896 

-272.748 

1 

2 

!  254.235 

29 

l 

-427.470 

-369.162 

3 

| 

j  361.859 

44 

-591.713* 

-503.251* 

1 

4 

376.186 

i 

:  59 

i  i 

-575.372** 

• 

-456.745** 

5 

380.982 

1 

74 

-539.964 

!  -391.177 

6 

i  245. 141* 

i 

t  89 

1 

-223. 232* 

-  44.337* 

<  =  7 

;  426.002 

• 

' 

104 

i - ! 

-540.004 

-330.997 

*  First  Minimum  AIC*  and  SC. 

**  Second  Minimum  AIC*  and  SC. 

*  aic*  and  SC  Values  During  5th  Iteration.  Mixture  Algorithm  Halted  at 
6th  Iteration.  Singular  Vari ance-Covari ance  Matrix. 


TABLE  6.2.2.  THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 


No.  Of  Types  i 
k 

In [max  L(k)]a 

No.  of  Parameters 
m& 

AIC*(k)C 

SC(k)d 

1 

171.448 

«  ! 

-300.896 

-272.446 

2 

191.137 

19 

-325.27  4* 

-287.072* 

3 

191.137 

24  | 

-310.274**  j 

-262.018** 

4 

191.137 

29 

-295.274 

-236.965 

5 

182.611 

34 

-263.222 

-194.861 

6 

191.137 

39 

-265.274 

-186.859 

K=7 

191.136 

44 

-250.272 

-161.806 

*  First  Minimum  AIC*  and  SC. 

**  Second  Minimum  AIC*  and  SC. 

*  a,  b,  c,  and  d  are  as  in  Tables  6.1.1  and  6.1.2. 


Now  examining  Tables  6.2.1  and  6.2.2,  we  see  in  Table  6.2.1  that  the  first 
minimum  AIC*  and  SC  occur  at  k=3  types,  the  second  minimum  AIC*  and  SC  occur  at 
k*4  types.  Thus,  both  criteria  choose  k=3  types  as  the  best  mixture  submodel. 

In  Table  6.2.2,  however,  we  see  completely  the  opposite  of  the  results  in 
Table  6.2.1.  Here,  the  first  minimum  AIC*  and  SC  both  occur  at  k=2  types,  and 
the  second  minimum  AIC*  and  SC  occur  at  k=3  types.  We  note,  however,  that, 
ln[max  L(k)],  except  k=l,  has  converged  to  the  same  value  for  k=2,3,...,7  types 
even  when  we  used  36  iterations.  That  is,  ln[max  L(k)]  for  k=2,...,7  are  all 


stationary.  Again,  since  mixture  analysis  attempts  to  find  maximum-likelihood 
estimates  of  the  parameters,  the  best  solution  for  our  purposes  is  the  one  with 
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the  greatest  likelihood,  or  the  greatest  log  likelihood.  Therefore,  comparing 
1 n[max  L(k)]  of  TaDle  6.2.1  and  6.2.2,  we  see  that  ln[max  L f k ) ]  are  the  greatest 
for  each  component  clusters  in  Table  6.2.1,  except  when  k=l.  This  suggests 
again  that  we  should  use  the  results  of  Table  6.2.1  where  the  covariance 
matrices  are  different  for  the  iris  data.  However,  one  should  not  be  puzzled 
with  the  noncovergence  of  ln[max  L(k)]  in  Table  6.2.2,  since  we  are  not  always 
guaranteed  convergence  in  iterative  procedures,  nor  are  we  guaranteed  that  the 
local  optimum  is  always  global.  We  show  such  a  result  to  demonstrate  that 
unexpected  things  also  might  happen. 

6.3.  When  Data  Initialized  by  K-Means  Algorithm 

It  is  a  well  known  fact  among  the  users  of  cluster  analysis  techniques 
that  in  the  multivariate  situation  satisfactory  or  good  initial  estimates  for 
the  parameters  are  almost  essential  to  start  the  iterative  clustering 
algorithms  to  avoid  misleading  solutions.  Specially,  in  the  mixture  analysis, 
there  may  be  many  different  solutions  of  the  maximum  likelihood  equations. 
Therefore,  suitable  initial  values  for  the  parameters  are  crucial  when  fitting 
mixtures  of  multivariate  normal  distributions  to  data  to  avoid  the  problem  of 
local  maxima  of  the  likelihood  function. 

In  the  literature,  Hartigan  (1975,  p.  124),  Everitt  (1981),  and  others, 
suggest  '’k-means"  algorithm  to  be  applied  to  data  first,  and  then  take  the 
resulting  cluster  centroids  (or  means),  etc.,  as  starting  values  for  component 
mean  vectors,  etc.,  in  the  maximum  likelihood  estimation  algorithm.  Following 
their  suggestions,  we  ran  ''k-means"  algorithm  by  using  the  BMDP  * -MEANS 
PROCEDURE  and  asxed  for  k=l,2,...,7  clusters  on  the  150  iris  plants.  ' ie  then 
toox  tne  resulting  cluster  centroids  for  each  k  and  used  them  as  starting 
values  for  component  mean  vectors  in  tne  mixture  analysis  for  k*l,2,...,7. 

We  obtained  the  following  nesuits< 
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TABLE  6.3.1.  THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 


No.  of  Types 
k 

ln[max  L(k)]a 

No.  of  parameters 
m 

AIC*(k)c 

H 

SCfk)n 

1 

I  171.448 

14 

-300.896 

-272.748 

2 

337.008 

29 

-587.016** 

-528.709* 

3 

358.709 

44 

i 

-585.418** 

-496.950** 

4 

314.804* 

1 

59 

-452.608* 

-333.981* 

5 

412.012 

74 

-602.024* 

-453.237 

6 

393.591* 

89 

-520.182* 

-341.236* 

II 

391.616* 

104 

-471.232* 

_ ! 

-252.125* 

1 

TABLE  6.3.2. 

THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 

DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 

No.  of  Types 

ln[max  L(k)]a 

No.  of  Parameters 

AIC*(k)C 

SC(k)d 

k 

j  m 

1 

171.448 

14 

-300.896  ■ 

| 

-272.748 

2 

254.915 

19 

-452.830 

-414. 62Q 

3 

|  295.001 

24 

-518.002  | 

-469.763 

4 

328.314 

29 

-569.628** 

-511.320* 

5 

334.065 

34 

-566.130 

-497.768** 

6 

339.119 

j  39 

-561.238 

-482.824 

<*7 

352.781 

j 

'  44 

i  -573.562* 

i 

I  -485.095 

i _ 

*  First  Minimum  AIC*  and  SC. 
**  Second  Minimum  AIC*  and  SC. 


*  4IC*  and  SC  Values  During  5th  Iteration.  Mixture  Algorithm  Halted  at 
Stn  Iteration.  Singular  Vari ance-Covari ance  Matrix. 

a,  o,  c,  and  d  are  as  in  Taoles  6.1.1  and  6.1.2. 
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Looking  at  Table  6.3.1  and  6.3.2,  we  see  in  Table  5.3.1  tnat  the  first 
minimum  AIC*  occurs  at  k=5  types  and  the  first  minimum  SC  occurs  at  '<=2  types. 
The  second  minimum  AIC*  occurs  at  k=2  types  and  at  k=3  since  the  values  are 
significantly  close  to  each  other.  Also,  the  second  minimum  SC  occurs  at  <=3 
types.  For  k»4,  k*6,  and  k=7  types,  the  mixture  algorithm  halted  at  6th 
iteration  due  to  singular  variance-covariance  matrix. 

In  Table  6.3.2,  we  see  that  the  first  minimum  AIC*  occurs  at  k=7  types 
and  the  second  minimum  AIC*  occurs  at  k*4  types.  On  the  other  hand,  the  first 
minimum  SC  occurs  at  k*4  types  and  the  second  minimum  SC  occurs  at  k=5  types. 
We  further  note  here  that  these  results  are  identical  to  those  obtained  in 
Table  6.1.2,  when  data  initially  partitioned  into  equal  size  groups  by  the 
algorithm. 

Even  tnough  using  ''k-means"  or  other  clustering  techniques  as  a  tool  of 
initializing  clusters  appear  to  be  the  most  obvious  way  to  obtain  suitable 
initial  values  for  the  parameters  in  the  mixture  analysis,  but  such  an 
approach  in  general  may  not  be  the  best  as  we  shall  see  in  the  next  two  sec¬ 
tions,  that  is,  in  Section  6.4  and  6.5,  respectively. 

6.4,  When  Data  Initialized  by  Special  Initial ization  Scheme 

In  Section  6.3,  we  gave  the  results  of  the  mixture  analysis  when  we 
initialized  the  mixture  algorithm  by  using  the  results  of  "k-means"  algorithm 
as  our  inputs  or  starting  values  for  component  mean  vectors.  As  we  mentioned, 
such  an  approach  in  general  may  not  be  the  best  and  cneap.  Therefore,  in  this 
section,  we  shall  propose  a  simple  and  less  expensive  initial ization  scheme 
which  has  intuitive  appeal  and  by-and-large  philosophical ly  is  acceptable. 

The  proposed  initialization  scheme  is  as  follows: 

(1)  We  first  compute  the  maximum  and  the  minimum  of  the  variables  across 
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all  data.  We  denote  this  bv  X  ,  and  X  . 

-max  -nun 


Let  R  =  x  -  X  .  be 
-max  -min 


the  range  of  the  data  on  the  variable  vector  x. 

(ii)  Next,  we  compute  the  average  of  X  .  and  X  .  We  denote  this  by 

-mm  -max 

1,,  *  (X  .  +  X  ,  )/2.  To  initialize  k=l  component  mixture,  we  use 

-11  -mm  -max  r 

X. j  as  the  component  mean  vector  in  the  mixture  analysis. 

(iii)  To  initialize  k=2  component  mixtures,  we  compute  X_.  =  (X  .  +  X  '/ 2, 

“•cl  7  n  ^  i  i 

and  =  (X^  +  ^max ) /2  to  be  entered  as  the  component  mean  vectors 
in  the  mixture  analysis. 

(iv)  To  initialize  k=3  component  mixtures,  we  compute  X^  =  (*min  +  X^j)/2, 

-32  *  ^-21  +  ^22 7 ,  and  X33  *  +  Xmax)/2  to  be  entered  as  the 

component  mean  vectors  in  the  mixture  analysis,  and  so  on. 


Thus,  we  continue  in  this  fashion  until  we  generate  all  the  initial  mean 
vectors  sequentially,  and  until  we  reach  the  larger  hypothesized  number  of 
component  clusters  K.  In  doing  this,  we  remain  in  the  range  of  the  data  on  the 
variable  vector  X.  Such  an  initialization  scheme  sets  up  cluster  centers 
regularly  spaced  at  intervals  on  eacVi  variable  which  is  less  expensive  and  easy 
to  program.  Of  course,  we  can  also  consider  outer  points  { i . e . ,  the  points 
outside  of  the  data  range)  and  use  the  above  initialization  scheme  to  initial¬ 
ize  the  mixture  and  other  clustering  algorithms,  which  we  did  not  pursue  it 
nere. 

Our  results  obtained  from  this  special  initialization  scheme  are  shown  in 
Tables  6.4.1  and  5.4.2. 
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TABLE  6.4.1.  THE  AIC*(k)'s  AND  SC (k ) ' s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  DIFFERENT  BETWEEN  CLUSTERS 


No.  of  Types 
k 

In [max  L(k)]a 

No.  of  Parameters 
m& 

AIC*(k)C 

SC (k  )d 

1 

171.448 

14 

-300.896 

-272.748 

2 

337.008 

29 

-587.016** 

-528.709* 

3 

371.177 

44 

-610.234* 

-521.887** 

4 

381.395 

59 

-585.790 

-467.163 

, 

5 

405  .  493 

74 

-588.986 

-440.200 

6 

426. 428 

I  89 

-585.856 

-406.911 

K=7 

433.193 

104 

-554.  386 

-345.279 

TABLE  6.4.2. 

THE  AIC*(k ) 1  s  AND  SC(k)‘s  FOR  STANDARD  MIXTURE  .MODEL  FOR  THE  IRIS 

DATA  'WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 

No.  of  Types 

In  [max  L(k)]a 

No.  ofhParameters 

A I C  *  ( k  ) C 

;  sc(k)d 

k 

mD 

I 

1 

171.448 

14 

-300.896 

-272.748 

2 

254.915 

19 

-452.830 

-414.629 

3 

295.009 

24 

-513.018 

-469.763 

4 

315.296 

29 

-543.592 

!  -485.284 

1 

5 

333.998 

34 

j  -565.996** 

;  -497.535* 

6 

341,242 

39 

i  -565.  448 

!  -487.070 

K=7 

355.  339 

|  44 

-578.678* 

1 

;  -490.210** 

i 

*  First  Minimum  AIC*  and  SC 

**  Second  Minimum  AIC*  and  SC 


a,  b,  c,  and  d  are  as  in  Tables  6.1.1  and  6.1.2. 


L  j 
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Examining  each  table  carefully,  starting  with  Table  5.4.1  *here  the 
covariance  matrices  are  different  between  clusters  (or  types),  we  see  that  the 
first  minimum  AIC*  is  when  k=3  types,  the  second  minimum  AIC*  is  when  k=2 
types.  That  is,  when  k*3  types  we  have  the  best  mixture  submodel.  On  the 
other  hand,  the  first  minimum  SC  occurs  at  k  =  2  types,  and  the  second  minimum  SC 
occurs  at  k=3  types.  Thus,  according  to  SC  k=2  types  is  the  best  mixture  sub¬ 
model.  Comparing  these  results  with  the  results  of  mixture  analysis  obtained 
from  initializing  the  mixture  algorithm  by  using  "k-means"  results  given  in 
Table  6.3.1,  we  clearly  see  that  our  initial ization  scheme  gives  better  results 
than  what  is  suggested  in  the  literature. 

In  Table  6.4.2  where  the  covariance  matrices  are  considered  to  he  equal 
between  clusters  (or  types),  we  see  that  the  first  minimum  AIC*  occurs  at  k=7 
types  and  the  second  minimum  AIC*  occurs  at  k=5  types.  SC  favors  the  same 
mixture  submodels  but  in  the  reversed  order  as  compared  to  AIC*.  Again  these 
results  are  not  surprising  since  the  population  covariance  matrices  of  the 
three  types  of  irises  are  not  equal  to  each  other,  and  ln[max  L(k)l  values 
are  greatest  for  each  component  cluster  in  Table  6.4.1  as  compared  to  the 
ln[max  l(k)]  values  given  in  Table  6.4.2,  except  when  k  =  l. 

6.5.  '■dhen  Special  Initialization  Scheme  is  Used  on  Reordered  Data 

Finally,  when  we  use  the  special  initialization  scheme  presented  in  Section 
5.4  on  the  reordered  data  to  start  the  mixture  algorithm  to  avoid  the  problem 
of  local  maxima  of  the  likelihood  function,  we  obtained  the  following  results. 


TABLE  6.5.1.  THE  AIC*(k)'s  AND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 
DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 


No.  of  Types 
k 

i  In  [max  L(k)]a 

1 . 

No.  of  Parameters 
m& 

AIC*(k)C 

SC(k)d 

i 

1 

171.448 

t 

14 

-300.896 

-272.  748 

2 

257.235 

29 

-427.470 

-369.162 

3 

358.219 

! 

44 

i 

-584.438* 

-495.970* 

4 

374.  422 

;  59 

-571.884** 

-453.217** 

5 

!  220.659# 

74 

-219.318# 

-  70.532# 

6 

218. 458# 

89 

|  -169.916# 

9.029# 

K=7 

226.395# 

104 

i 

i _ i 

1  -140.  790# 

1 

-  68.314# 

table  6.5.2. 

THE  AIC*(k ) 1 s  A 

ND  SC(k)'s  FOR  STANDARD  MIXTURE  MODEL  FOR  THE  IRIS 

DATA  WHEN  COVARIANCE  MATRICES  ARE  EQUAL  BETWEEN  CLUSTERS 

No.  of  Types 

In [max  L(k)]a 

1  No.  of  Parameters 

AIC*(k)C  | 

SC(k)d 

k 

m 

1 

1  1 

171.448 

14 

-300.896  1 

-272.  748 

2 

1 

191.135 

19 

-325.270 

-287.068 

3 

295.009 

24 

-518.018* 

1 

-469.763* 

4 

287.889 

29 

-488.778** 

-430.470** 

5 

171.531# 

34 

I 

-241.062# 

-172.701# 

6 

171.559# 

39 

1  | 

-226.118# 

1 

!  -147.704# 

K=7 

| 

I  171.576# 

! 

1  1 

44  j 

_ 1 

-211.152# 

-122.685# 

1 _ 

*  First  Minimum  AIC*  and  SC. 
**  Second  Minimum  AIC*  and  SC. 


#  AIC*  and  SC  Values  During  5th  Iteration.  Mixture  Algorithm  Halted  at 
5tn  Iteration.  Singular  Vari ance-Covari ance  Matrix. 

a,  0,  c,  and  d  are  as  in  Tables  6.1.1  and  6.1.2. 


j 

i 
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Looking  at  Tables  6.5.1  and  6.5.2,  we  see  that  under  both  different  and 
equal  covariance  matrices  between  clusters  (or  types),  the  first  minimum  AIC* 
and  SC  occur  at  k=3  types.  The  second  minimum  AIC*  and  SC  occur  at  k=4  types. 
Thus,  in  this  case  according  to  AIC*  and  SC  k=3  types  is  the  best  mixture  sub¬ 
model.  Comparing  the  values  of  AIC*  and  SC  for  k=2,3,  and  4  types  in  Table 

6.5.1  and  6.5.2,  respectively,  we  can  see  that  the  AIC*  and  SC  values  in  Table 

6.5.2  are  larger  than  the  AIC*  and  SC  values  in  Table  6.5.1,  suggesting  to  us 
that  when  we  are  clustering  iris  data,  and  in  general,  we  should  use  different 
covariance  matrices  rather  than  using  equal  covariance  matrices.  Thus,  model  - 
selection  criteria  can  also  be  used  to  decide  whether  or  not  to  assume  a  common 
covariance  matrix. 

From  the  results  in  Table  6.5.1  and  6.5.2,  we  further  note  that  it 
suffices  to  fit  K=5  hypothesized  number  of  mixtures  to  Fisher  iris  data  rather 
than  fitting  K=7  multivariate  normal  mixtures. 

7.  Conclusions  and  Discussion 

From  our  numerical  results  in  Section  6,  we  see  that  model -selection 
criteria  can  indeed  be  used  to  estimate  k,  the  number  of  component  clusters  (or 
types)  in  the  mixture  model,  when  we  do  not  know  the  group  structure  of  the 
data  a_  priori . 

Summarizing  the  results  on  the  number  of  t’vs  the  minimum  AIC*  and  SC 
selected  each  mixture  submodel  across  all  the  tables  given  in  Section  6,  we 
obtain  the  following  frequencies. 
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TABLE  7.1.  SUMMARY  OF  THE  RESULTS  OF  AIC*(k)'s  AND  SC(k)'s  FOP  STANDARD 
MIXTURE  '’OPEL  FOR  THE  IRIS  DATA 


- ! 

i 

No.  of  Types 

Number  of  Times 

Number  of  Times 

k 

AIC*(k)  Selected 

SC(k)  Selected 

1 

0 

0 

2 

t 

2 

: 

5 

3 

4 

2 

4 

0 

2 

5 

1 

1 

6 

0 

0 

K=7 

3 

0 

Looking  at  Table  7.1,  we  see  that  AIC*  identifies  the  correct  group 
structure  (i.e.,  k=3  types)  in  the  Fisher  iris  data  four  times  as  compared  to 
SC  which  identifies  the  correct  structure  twice.  AIC*  chooses  k=2  types  twice, 

SC  chooses  k=2  types  five  times  indicating  that  SC  favors  1 ower-dimensi onal 
models  as  compared  to  AIC*.  The  case  where  k=7  types  was  chosen  three  times  by 
AIC*  corresponds  to  the  results  where  the  covariance  matrices  between  clusters 
were  assumed  to  be  equal  instead  of  different.  In  these  applications,  however, 
these  criteria  often  agree  in  identifying  the  correct  model. 

In  the  literature,  objections  have  been  raised  that  minimizing  the  AIC* 
does  not  produce  an  asymptotically  consistent  estimate  of  the  model.  For  this, 
we  shall  refer  tne  reader  to  Schwarz  (1978),  Bhansali  and  Downham  (1977).  But  as 
also  mentioned  by  Larimore  (1983),  no  strong  reasons  have  been  offered  for  why 
such  consistency  would  be  desirable  or  would  give  sensible  results  generally, 
since  in  most  applications  such  as  the  one  we  presented  in  this  paper,  we  can 
vary  the  class  of  alternative  models  hut  not  the  number  of  observations.  As 
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Akaike  (1981)  states:  ".  .  .  Tnis  inconsistency  of  order  determination  does  not 
necessarily  mean  a  serious  problem,  as  expected  deviation  of  the  fitted  model 
in  terms  of  entropy  decreases  to  its  minimum  possible  value  as  the  data  length 
tends  to  infinity.  This  means  that  the  procedure  is  inconsistent  in  terms  of 
our  basic  criterion,  [f  AIC  is  replaced  by 

-2  ln(maximized  likelihood) 

+f(n) (number  of  free  parameters), 

where  f(n)  is  a  function  which  increases  without  bound,  yet  such  that  f(n)/o  -*  0, 
as  n  tends  to  infinity,  then  the  corresponding  MAICF  produces  a  consistent 
estimate  of  the  order  when  this  does  exist." 

Therefore,  consistency  for  a  given  class  of  models  within  a  fixed  number  of 
observations  is  not  a  problem  for  a  good  model -selection  criterion.  Specially 
in  classification  and  clustering  problems  we  do  not  have  to  worry  about  con¬ 
sistency  or  the  order  of  a  model. 

For  example,  from  Table  7.1,  we  see  that  Schwarz'  Criterion  (SC)  which  is 
a  consistent  modified  version  of  AIC,  does  not  necessarily  pick  up  the  correct 
group  structure  more  often  than  AIC*  in  the  Fisher  iris  data  even  wnen  it  is 
known  a  priori  that  there  are  three  types  of  species  of  irises.  So  the 
question  is:  "What  kinds  of  penalty  should  the  decision  maker  pay  while 
trying  to  expect  consistency  for  the  model  wnen  indeed  no  consistency  problem 
exists  in  a  finite  sample  situation?" 

Thus,  it  seems  that  to  argue  consistency  when  data  contains  a  finite 
sample  size  is  fruitless.  The  performances  of  these  model -sel action  criteria 
most  often  depend  strongly  on  the  class  of  models,  on  the  nature  of  the  prior 
speci *ication  corresponding  to  whicn  these  criteria  are  derived,  and  of  course. 
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on  the  type  of  data  sets  they  are  applied. 

Thus,  in  concluding,  we  see  that  our  numerical  results  clearly  demon¬ 
strate  the  potential  of  both  Akaike's  Information  Criterion  (AIC),  and 
Schwarz1  Criterion  (SC)  in  identifying  the  best  clustering  alternative  or 
alternatives,  and  estimating  the  number  of  component  clusters  present  in  the 
mixture  model.  These  model-selection  criteria  are  defined  without  any 
reference  to  a  particular  null  hypothesis  and  are  measures  of  the  badness  of 
the  model  which  are  free  from  the  ambiguities  inherent  in  the  application  of 
conventional  procedures. 
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